In this dashboard, we delve into the intricacies of customer loyalty programs within the airline industry. Our dataset, sourced from Northern Lights Air (NLA), a fictional Canadian airline, encapsulates the essence of customer engagement, flight activity, and demographic insights.
Link: https://mavenanalytics.io/data-playground?search=airline+loyalty+program Note: Please scroll to the bottom of the page to find the airline loyalty program dataset.
This dashboard aims to explore relationships between predictor variables and the total number of flights in airline loyalty programs. We will analyze the data using multiple regression, ridge regression, non-linear regression, and classification techniques.
Here is a brief overview of the variables in the dataset:| Variable | Description |
|---|---|
| Loyalty.Number | Unique identifier for loyalty program members |
| Year | Year of flight activity |
| Month | Month of flight activity |
| Total.Flights | Total number of flights |
| Distance | Distance traveled in flights (in miles) |
| Points.Accumulated | Loyalty points accumulated |
| Points.Redeemed | Loyalty points redeemed |
| Dollar.Cost.Points.Redeemed | Cost of loyalty points redeemed (in dollars) |
| Country | Country of residence |
| Province | Province or state of residence |
| City | City of residence |
| Postal.Code | Postal code of residence |
| Gender | Gender of member |
| Education | Level of education |
| Salary | Annual salary of member |
| Marital.Status | Marital status of member |
| Loyalty.Card | Type of loyalty card |
| CLV | Customer lifetime value |
| Enrollment.Type | Type of enrollment (e.g., standard, premium) |
| Enrollment.Year | Year of enrollment |
| Enrollment.Month | Month of enrollment |
| Cancellation.Year | Year of cancellation (if applicable) |
| Cancellation.Month | Month of cancellation (if applicable) |
The section runs a Naive Bayes model that looks at various details about loyalty program members to guess their education level. It uses information like where they live, their gender, and how they interact with the loyalty program to make its predictions. This helps understand if certain traits are common among members with different education backgrounds. The idea is to use this understanding to better tailor the loyalty program to fit different members’ needs.
How well can demographic factors, geographic factors, and loyalty program status predict the education level of participants in a airline loyalty rewards program?
library(e1071)
library(dplyr)
library(caret)
data <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/merged_data_EP.csv")
data$Cancellation.Year <- NULL
data$Cancellation.Month <- NULL
data <- na.omit(data)
categorical_vars <- c("Country", "Province", "City", "Postal.Code", "Gender",
"Education", "Marital.Status", "Loyalty.Card", "Enrollment.Type")
data[categorical_vars] <- lapply(data[categorical_vars], factor)Confusion Matrix and Statistics
Reference
Prediction Bachelor Doctor High School or Below Master
Bachelor 48766 807 3643 1428
Doctor 40 2510 6 175
High School or Below 183 0 22 1
Master 107 171 0 813
Overall Statistics
Accuracy : 0.8882
95% CI : (0.8856, 0.8907)
No Information Rate : 0.8368
P-Value [Acc > NIR] : < 2.2e-16
Kappa : 0.4845
Mcnemar's Test P-Value : < 2.2e-16
Statistics by Class:
Class: Bachelor Class: Doctor Class: High School or Below
Sensitivity 0.9933 0.71961 0.005993
Specificity 0.3862 0.99600 0.996655
Pos Pred Value 0.8924 0.91908 0.106796
Neg Pred Value 0.9181 0.98252 0.937588
Prevalence 0.8368 0.05945 0.062568
Detection Rate 0.8312 0.04278 0.000375
Detection Prevalence 0.9313 0.04655 0.003511
Balanced Accuracy 0.6897 0.85780 0.501324
Class: Master
Sensitivity 0.33637
Specificity 0.99506
Pos Pred Value 0.74519
Neg Pred Value 0.97214
Prevalence 0.04120
Detection Rate 0.01386
Detection Prevalence 0.01859
Balanced Accuracy 0.66571
The accuracy of the Naive Bayes classifier is 0.8882, indicating that approximately 88.82% of the predictions made by the model are correct.
The confusion matrix and its accompanying metrics shed light on the nuanced performance of the Naive Bayes model, delineating its proficiency in classifying true positive and negative instances as well as the equilibrium it strikes between precision and recall. These statistical insights are imperative for the iterative enhancement of the model’s predictive accuracy. The model’s ability to effectively differentiate between various educational levels, as indicated by the metrics, can provide a foundation for tailored communication strategies and targeted marketing initiatives.
In the statistics part outputted by the confusion matrix there are many important statistics to look at: Sensitivity (True Positive): Measures the proportion of actual positives which are correctly identified. Specificity (True Negative): Measures the proportion of actual negatives which are correctly identified. Positive Pred Value: Reflects the proportion of positive identifications that were actually correct. Negative Pred Value: Reflects the proportion of negative identifications that were actually correct. Prevalence: Indicates how common the positive class is in the dataset. Detection Rate: This is the proportion of the total number of instances that were correctly identified as positive. Detection Prevalence: The proportion of the total number of instances that were predicted as positive. Balanced Accuracy: This metric averages the proportion of true results (both true positives and true negatives) among the total number of cases examined.
Through the examination of the confusion matrix and r-squared value produced by the Naive Bayes model we can see that the model has an 88% chance of predicting the correct level of education a loyalty program member has based on multiple predictors e.g. their salary, which card they have, and when they enrolled. Some takeaways from the statistics portion of the confusion matrix are that the high value of Sensitivity (0.993) for the Bachelor class indicates that the model did very well in predicting the true positives of the Bachelor class. Another thing we can takeaway from the confusion matrix is the model did very well in predicting both the true positives and true negatives of the Doctor class as we can see from the pos pred (0.919) and neg pred values (0.982). Overall the model did a pretty good job at predicting the education level of the loyalty program members
This section utilizes Logistic Regression to explore whether the year of manufacture can predict the type of electric vehicle, categorizing into types such as hybrids, battery electric vehicles (BEVs), and plug-in hybrids (PHEVs). Logistic regression is apt for this task as it models the probability that a given vehicle falls into a specific category based on its year of manufacture. The method involves setting up a logistic curve, which estimates the likelihood of each vehicle type as a function of its production year, outputting a probability score between 0 and 1. This approach not only aids in predicting the type of electric vehicle but also helps in understanding how the characteristics of electric vehicles have evolved over the years, providing valuable insights for manufacturers planning future models and for analysts studying automotive trends.
Can we use the year an electric car was made in order to predict what type of battery it has?
# Read the data
electric = read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Electric_Vehicle_Population_Data (1).csv")
# View the unique values in Electric Vehicle Type to check the encoding
unique(electric$Electric.Vehicle.Type)
# Encoding 'Electric Vehicle Type' as a factor
electric$Electric.Vehicle.Type <- as.factor(electric$Electric.Vehicle.Type)
# Splitting the dataset into training and testing sets
set.seed(42)
training_rows <- createDataPartition(electric$Electric.Vehicle.Type, p = 0.8, list = FALSE)
train_data <- electric[training_rows, ]
test_data <- electric[-training_rows, ]# Fit the logistic regression model
model <- glm(Electric.Vehicle.Type ~ Model.Year, data = train_data, family = binomial())
# Predicting on the test set
predictions <- predict(model, test_data, type = "response")
predicted_classes <- ifelse(predictions > 0.5, 1, 0)
# Actual classes
actual_classes <- as.numeric(test_data$Electric.Vehicle.Type) - 1# Calculate accuracy
accuracy <- mean(predicted_classes == actual_classes)
print(paste("Accuracy on test set:", accuracy))[1] "Accuracy on test set: 0.782447361763135"
Call:
glm(formula = Electric.Vehicle.Type ~ Model.Year, family = binomial(),
data = train_data)
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) 249.659957 4.082609 61.15 <2e-16 ***
Model.Year -0.124215 0.002021 -61.46 <2e-16 ***
---
Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 148984 on 142292 degrees of freedom
Residual deviance: 145284 on 142291 degrees of freedom
AIC: 145288
Number of Fisher Scoring iterations: 4
After looking at the model we can see that the p-value for model year is statistically significant which means that it is a viable variable to look at. The number of Fisher Scoring iterations is 4 which tells us that the model took 4 iterations to reach the final model. After calculating the accuracy of the model we see that we get an accuracy of 0.782 which indicates that our model predicted the correct electric vehicle type based on the model year 78% of the time.
This section uses Ridge Regression to analyze factors that affect flight frequency among loyalty program customers, considering variables like income, past flights, and seasonal preferences. By applying a penalty to the regression coefficients, it minimizes errors from multicollinearity and balances the model’s complexity. The optimal penalty parameter, λ (lambda), is determined through cross-validation to ensure accuracy without overfitting. This analysis helps identify the main drivers of customer behavior, enabling the airline to optimize its loyalty program for better engagement and effectiveness.
What factors most significantly influence the frequency of flights taken by customers in a loyalty program, and how can these insights be used to optimize the program’s effectiveness and customer engagement?
library(glmnet)
data <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/merged_data_EP.csv")
# Predictor Variables and Target
predictors <- data[, c("Distance", "Points.Accumulated", "Points.Redeemed")]
target <- data$Total.Flights # Now predicting Total Flights
# Scaling predictors
predictors_scaled <- scale(predictors)
# Matrix format conversion
predictors_matrix <- as.matrix(predictors_scaled)
target_vector <- as.vector(target)Optimal lambda: 0.178239
4 x 1 sparse Matrix of class "dgCMatrix"
s0
(Intercept) 1.29488772
Distance 0.97198159
Points.Accumulated 0.72855024
Points.Redeemed 0.07320769
Call: cv.glmnet(x = predictors_matrix, y = target_vector, alpha = alpha)
Measure: Mean-Squared Error
Lambda Index Measure SE Nonzero
min 0.1782 100 0.6858 0.005123 3
1se 0.2147 98 0.6891 0.005021 3
R-squared on the training data: 0.8220202
The coefficient path plot shows us the paths of the coefficients as the log of lambda changes. Each line represents a coefficient from the model, and its path shows how the coefficient value changes as lambda increases. The cross validation plot shows the cross-validation curve for selecting lambda. The x-axis represents the log of lambda, while the y-axis shows the mean squared error (MSE) for the cross-validation folds. Each dot represents the MSE for a specific lambda during cross-validation. These plots show the simplicity of the model and slight overfitting. From our model we can see that our optimal lambda value is 0.178 which can indicate a few things such as the complexity of the model and if our model is over fitted. Since the value is low the model may need to be more complex and there is slight overfitting that occurs. We can also see that our accuracy of the model is 0.822 which is pretty good however, since the optimal lambda value is a little low it may be because of over fitting. Overall the model does a decent job of using Distance, Points Accumulated, and Points Redeemed to predict the total number of flights we cannot be certain on the model’s validity as the assumptions of constant variance and normality were not passed.
This section delves into how the kNN classification model can predict customer loyalty point redemption. Airline loyalty programs are crucial for retaining customers and encouraging repeated business. Understanding patterns in how and when loyalty points are redeemed can help airlines tailor their programs to better meet customer needs, predict future redemptions, and manage capacity effectively. Using the kNN classification model, we aim to assesses the proximity of individual customer profiles to identify patterns that suggest a higher likelihood of point redemption. By analyzing key features such as travel frequency, accrued points, and redemption history, we can better target promotional efforts to enhance participation in the loyalty program.
Research Question: Which customer profile attributes are most predictive of loyalty point redemption, and how accurately can the kNN classification model forecast these point redemptions?
Link: https://mavenanalytics.io/data-playground?search=airline+loyalty+program
customer_loyalty_history <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Airline+Loyalty+Program (1)/Customer Loyalty History.csv")
calendar <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Airline+Loyalty+Program 2/Calendar.csv")
customer_flight_activity <- read.csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Airline+Loyalty+Program 2/Customer Flight Activity.csv")
merged_data <- merge(customer_flight_activity, customer_loyalty_history, by = "Loyalty.Number")
##Preprocessing of data :
merged_data <- merged_data %>%
mutate(
Country = as.factor(Country),
Province = as.factor(Province),
Gender = as.factor(Gender),
Education = as.factor(Education),
Marital.Status = as.factor(Marital.Status),
Enrollment.Tenure = year(Sys.Date()) - Enrollment.Year,
Cancellation.Year = ifelse(is.na(Cancellation.Year), 0, Cancellation.Year),
Cancellation.Month = ifelse(is.na(Cancellation.Month), 0, Cancellation.Month)
)set.seed(123)
indices <- sample(1:nrow(merged_data), size = 0.7 * nrow(merged_data))
train_df <- merged_data[indices, ]
test_df <- merged_data[-indices, ]
train_df$PointsRedeemedBinary <- ifelse(train_df$Points.Redeemed
!= 0, 1, 0)
test_df$PointsRedeemedBinary <- ifelse(test_df$Points.Redeemed
!= 0, 1, 0)
relevant_vars <- c('Loyalty.Card', 'Total.Flights', 'Points.Accumulated', 'Dollar.Cost.Points.Redeemed', 'Salary')
complete_rows_train <- complete.cases(train_df[, relevant_vars])
train_df <- train_df[complete_rows_train, ]
complete_rows_test <- complete.cases(test_df[, relevant_vars])
test_df <- test_df[complete_rows_test, ]
train_features <- data.frame(model.matrix(~ Loyalty.Card + Total.Flights + Points.Accumulated + Dollar.Cost.Points.Redeemed + Salary - 1, data = train_df))
test_features <- data.frame(model.matrix(~ Loyalty.Card + Total.Flights + Points.Accumulated + Dollar.Cost.Points.Redeemed + Salary - 1, data = test_df))
train_labels <- train_df$PointsRedeemedBinary
test_labels <- test_df$PointsRedeemedBinary
k <- 9
knn_result <- FNN::knn(train = train_features, test = test_features, cl = train_labels, k = k)# Load the required libraries
knn_result <- factor(knn_result, levels = c("0", "1"))
test_labels <- factor(test_labels, levels = c("0", "1"))
conf_matrix <- confusionMatrix(data = knn_result, reference = test_labels)
precision <- posPredValue(conf_matrix$table, positive = "1")
recall <- sensitivity(conf_matrix$table, positive = "1")
f1_score <- (2 * precision * recall) / (precision + recall)
conf_matrix_df <- as.data.frame.matrix(conf_matrix$table)
metrics_df <- data.frame(
Metric = c("Accuracy", "Precision", "Recall", "F1 Score"),
Value = c(conf_matrix$overall['Accuracy'], precision, recall, f1_score)
)
confusion_matrix_table <- kable(conf_matrix_df, caption = "Confusion Matrix", align = 'c') %>%
kable_styling(bootstrap_options = c("striped", "hover"))
performance_metrics_table <- kable(metrics_df, caption = "Performance Metrics", align = c('l', 'c')) %>%
kable_styling(bootstrap_options = c("striped", "hover"))
cross_tab_table <- CrossTable(knn_result, test_labels, prop.chisq = FALSE, prop.t = FALSE, prop.r = FALSE) %>%
as.data.frame() %>%
kable(caption = "Cross-Tabulation", align = c('l', 'c')) %>%
kable_styling(bootstrap_options = c("striped", "hover"))Cell Contents |————————-| | N | | N / Col Total | |————————-|
Total Observations in Table: 88229
| test_labels
| knn_result | 0 | 1 | Row Total |
|---|---|---|---|
| 0 | 82904 | 4980 | 87884 |
| 1.000 | 0.938 | ||
| ————- | ———– | ———– | ———– |
| 1 | 14 | 331 | 345 |
| 0.000 | 0.062 | ||
| ————- | ———– | ———– | ———– |
| Column Total | 82918 | 5311 | 88229 |
| 0.940 | 0.060 | ||
| ————- | ———– | ———– | ———– |
| 0 | 1 | |
|---|---|---|
| 0 | 82904 | 4980 |
| 1 | 14 | 331 |
| Metric | Value |
|---|---|
| Accuracy | 0.9433973 |
| Precision | 0.9594203 |
| Recall | 0.0623235 |
| F1 Score | 0.1170438 |
| t.x | t.y | t.Freq | prop.row.x | prop.row.y | prop.row.Freq | prop.col.x | prop.col.y | prop.col.Freq | prop.tbl.x | prop.tbl.y | prop.tbl.Freq |
|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 82904 | 0 | 0 | 0.9433344 | 0 | 0 | 0.9998312 | 0 | 0 | 0.9396457 |
| 1 | 0 | 14 | 1 | 0 | 0.0405797 | 1 | 0 | 0.0001688 | 1 | 0 | 0.0001587 |
| 0 | 1 | 4980 | 0 | 1 | 0.0566656 | 0 | 1 | 0.9376765 | 0 | 1 | 0.0564440 |
| 1 | 1 | 331 | 1 | 1 | 0.9594203 | 1 | 1 | 0.0623235 | 1 | 1 | 0.0037516 |
Precision: The proportion of positive identifications that were actually correct.The precision value of 0.9594 indicates a high likelihood that when the model predicts a customer will redeem points, this prediction is correct. In this context, if a promotional offer or incentive is provided based on the prediction, a high precision ensures that resources are not wasted on customers who are unlikely to redeem points.
Recall: The proportion of actual positives that were identified correctly. The recall value of 0.0623 is quite low, indicating that the model identifies a relatively small proportion of all actual positive cases (customers who did redeem points). This suggests that while the predictions made by the model are usually correct, it misses a large number of customers who would redeem points but were not predicted to do so.
F1 Score: The F1 score, at 0.1170, is quite low, which highlights a problem with the balance between precision and recall in the model. The F1 score is the harmonic mean of precision and recall. A low F1 score in the presence of a high precision and low recall indicates that the cost of false negatives (failing to identify customers who would redeem points) significantly impacts the model’s performance.
In conclusion, the model appears to be rather conservative in predicting point redemption, focusing on being correct when it does make such a prediction (high precision), but at the expense of failing to identify a substantial number of customers who do redeem points (low recall). In this context, this may mean that while the promotions might be efficiently allocated, there are missed opportunities in reaching customers who would engage with the loyalty program if prompted.
The accuracy of the kNN model is 0.943397295673758, indicating that approximately 94.34% of the predictions made by the model are correct. While the high accuracy is reassuring, given the precision, recall and f1 score.In summary, the accuracy tells us the model is overall reliable in its predictions, but it tends to be extremely selective, which end in missed opportunites to identifying many customers who would have redeemed points. This selective approach is efficient in not wasting resources but not effective in engaging all potential customers.
The dataset provided contains detailed information on the population of electric vehicles (EVs) registered within a specific geographic region, presumably focusing on the state of Washington, USA, given the frequent appearance of cities and counties within this state in the data. It comprises 177,866 records, each representing a unique electric vehicle, as identified by a partial Vehicle Identification Number (VIN) listed in the “VIN (1-10)” column. The dataset encompasses a range of variables including geographic (County, City, State, Postal Code, Vehicle Location, 2020 Census Tract), vehicle specifics (Model Year, Make, Model, Electric Vehicle Type, Electric Range, Base MSRP), registration details (DOL Vehicle ID), legislative context (Legislative District), and utility information (Electric Utility).The dataset’s timeframe spans from vehicles manufactured in 1997 up to the model year 2024, showcasing the growth and development of the electric vehicle market over nearly three decades.
Analysis opportunities range from examining EV distribution and density to assessing trends in adoption over time and exploring the relationship between EV usage and legislative or utility districts. This dataset serves as a valuable resource for informing stakeholders, including policymakers, manufacturers, and energy providers, about the current state and future prospects of EV adoption and infrastructure development.
The secondary dataset was chosen due to compatibility issues encountered with the airline loyalty dataset. MLR (Multiple Linear Regression), logistic regression and loess functions were not suitable for the airline dataset, resulting in low-quality outputs with an R-squared value of 0.05 or lower. Despite attempting various combinations and transformations, the desired results could not be achieved. Therefore, an alternative dataset was sought where MLR and loess analyses could be conducted effectively.
| Attribute | Description |
|---|---|
| VIN (1-10) | Vehicle identification number first 10 characters. |
| County | Administrative division where the vehicle is registered. |
| City | Urban settlement where the vehicle owner resides. |
| State | US state where the vehicle is registered. |
| Postal.Code | ZIP code for the vehicle owner’s address. |
| Model.Year | Year of manufacture of the vehicle. |
| Make | Manufacturer or brand of the vehicle. |
| Model | Specific model of the electric vehicle. |
| Electric.Vehicle.Type | Classification of electric vehicle, such as Battery Electric Vehicle (BEV). |
| Clean.Alternative.Fuel.Vehicle | Indicates whether the vehicle qualifies as a Clean Alternative Fuel Vehicle. |
| Electric.Range | Maximum distance the vehicle can travel on a single charge. |
| Base.MSRP | Manufacturer’s suggested retail price without any additional features. |
| Legislative.District | Legislative district where the vehicle is registered. |
| DOL.Vehicle.ID | Department of Licensing’s unique identifier for the vehicle. |
| Vehicle.Location | Geographic coordinates of the vehicle’s registered location. |
| Electric.Utility | Utility provider supplying electricity to the vehicle’s location. |
| 2020.Census.Tract | Census tract where the vehicle owner resides, based on the 2020 Census. |
In addition to Multiple Linear Regression (MLR), another powerful technique for analyzing the relationship between variables is LOESS (Local Polynomial Regression). LOESS is a non-parametric regression method that allows for more flexible modeling of complex relationships between variables.LOESS works by fitting a smooth curve to the data by locally estimating a polynomial regression model.
Using the Electric Vehicle dataset, LOESS could be utilized to explore the potential non-linear relationships between predictors (such as Model Year, MSRP, or Legislative District) and the Electric Range of EVs.
Research Question: “What is the relationship between the Model Year and Electric Range of electric vehicles (EVs), and how does it vary over time?”
library(tidyverse)
library(ggplot2)
df <- read_csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Electric_Vehicle_Population_Data (1).csv")
# Data preprocessing: Impute missing values and filter
missing_values <- sum(is.na(df))
numeric_cols <- sapply(df, is.numeric)
df[, numeric_cols] <- lapply(df[, numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
categorical_cols <- sapply(df, is.character)
df[, categorical_cols] <- lapply(df[, categorical_cols], function(x) ifelse(is.na(x), "Unknown", x))
df <- df[df$Electric.Range > 0, ]
##sampling
library(dplyr)
sample_size <- 40000
sampled_df <- df %>% sample_n(sample_size)Analysis: Based on the plots, there seems to be a general increasing trend in electric range starting from the earliest model year shown until around 2012, where it reaches a peak. Around 2020, the electric range hits its maximum value on the smooth curve, suggesting that models from around that year had the highest electric range on average. After the peak, there’s a sharp decline in the electric range, which could suggest a change in production models, a shift in manufacturer priorities, or possibly the introduction of many new models with lower ranges. There is considerable variability in the data points, especially in the later years, where the electric range varies widely across different models. There is also an indication of potential outliers as there are points that seem to deviate significantly from the majority of the data, especially in the later years, which could be outliers or special types of vehicles such as high-performance models.
Model Fit: The LOESS curve fits the data points in the middle range well. But the fit is less accurate at the extreme ends, particularly in the latest years, where there is more scatter away from the curve. The curve appears to have a moderate span, as it smooths out much of the variability without overfitting. It captures the overall trend without following every small variation in the data points.
Different Spans: As expected, the plot with span of 0.75 had a smoother curve, as it takes into account a larger fraction of points in each local regression, leading to a more general trend with less sensitivity to local fluctuations. In comparison to the other two, A span of 0.25 will produce the least smooth curve, as it considers a smaller fraction of points, allowing the curve to follow local variations more closely. The plot with a higher span will highlight broader trends in the data over time, possibly showing a general increase or decrease in electric range. The plot with a lower span will show more of the short-term fluctuations or variations within the data, which could be important for understanding yearly or model-specific changes in electric range.
In summary, the plot indicates an improvement in electric vehicle range up to around 2020, with a subsequent unexpected decline. It also highlights the variability in electric ranges across models, especially in more recent years. Therefore, the plots do answer the research question by illustrating the nature of the relationship between Model Year and Electric Range and demonstrating how this relationship has changed over time. The LOESS curve effectively captures the central trend despite the individual variabilities. And the dispersion of points around the curve in recent years highlights that the relationship is not strictly linear and may be influenced by other factors not captured.
The MLR (Multiple Linear Regression) Dashboard is designed to provide comprehensive insights into the factors influencing electric vehicle (EV) adoption within Washington state, USA. This tool utilizes a robust dataset comprising 177,866 unique records spanning EVs manufactured from 1997 to 2024, encompassing variables such as geographic data, vehicle specifics, registration details, legislative context, and utility information.
Independent Variables: The model considers a range of predictors, including:
Dependent Variable: The dependent variable is the Electric Range of the EVs, indicating the distance an EV can travel on a single charge.
Data Preparation: Prior to analysis, categorical predictors are appropriately transformed using factorization. Outliers and missing values are addressed to ensure the integrity of the dataset.
Research Question: What are the significant factors influencing the Electric Range of electric vehicles (EVs), and how well can these factors explain the variation in Electric Range?
Model Building: The MLR model is constructed using the lm() function in R, with the Electric Range as the dependent variable and the predictors as independent variables. The model aims to show the relationship between these predictors and the Electric Range of EVs.
Model Evaluation: Performance metrics such as Mean Squared Error (MSE) and R-squared (R^2) are computed to assess the accuracy and goodness-of-fit of the model.
Predictive Analytics: The trained MLR model is used to make predictions on unseen data, allowing stakeholders to predict Electric Range based on various vehicle characteristics and legislative contexts.
Model Refinement and Selection:Fine-tune the model by considering techniques like removing outliers, using step wise model selection and applying log transformation to improve its predictive power. Finally, we then select the final MLR model based on its performance metrics and interpret the coefficients to understand the impact of each feature on the target variable.
df <- read_csv("/Users/jitpapneja/Google Drive/TCCC Old/Personal/Electric_Vehicle_Population_Data (1).csv")
missing_values <- sum(is.na(df))
numeric_cols <- sapply(df, is.numeric)
df[, numeric_cols] <- lapply(df[, numeric_cols], function(x) ifelse(is.na(x), mean(x, na.rm = TRUE), x))
categorical_cols <- sapply(df, is.character)
df[, categorical_cols] <- lapply(df[, categorical_cols], function(x) ifelse(is.na(x), "Unknown", x))
df <- df[df$Electric.Range > 0, ]target <- 'Electric.Range'
predictors <- c('Model.Year', 'Make', 'Model', 'Electric.Vehicle.Type', 'Clean.Alternative.Fuel.Vehicle', 'Base.MSRP', 'Legislative.District')
df[predictors] <- lapply(df[predictors], factor)
df_cleaned <- df[-c(18916, 62033, 66313), ]
set.seed(123) # For reproducibility
training_index <- createDataPartition(df_cleaned[[target]], p = 0.8, list = FALSE)
train_data <- df_cleaned[training_index, ]
test_data <- df_cleaned[-training_index, ]
formula <- as.formula(paste(target, '~', paste(predictors, collapse = ' + ')))
mlr_model <- lm(formula, data = train_data)
summary_mlr <- summary(mlr_model)
predictions <- predict(mlr_model, test_data)
mse <- mean((test_data[[target]] - predictions)^2)
r2 <- summary_mlr$r.squared
cat("MSE: ", mse)MSE: 281.423
R^2: 0.9720977
lm(formula = formula, data = train_data)
Below is a snippet of the coefficients from the model.
| Term | Estimate | Std. Error | t value | Pr(> |
|---|---|---|---|---|
| (Intercept) | 6.078469 | 16.950329 | 0.359 | 0.719892 |
| Model.Year1998 | 13.203620 | 23.492188 | 0.562 | 0.574089 |
| Model.Year1999 | 29.386572 | 18.205588 | 1.614 | 0.106499 |
| Model.Year2000 | 13.503411 | 18.204538 | 0.742 | 0.458235 |
| … | … | … | … | … |
Key Findings:
Coefficient Estimates: The coefficients represent the estimated change in Electric Range for a one-unit change in each predictor variable, holding all other variables constant. For example, a coefficient of 13.20 for Model Year 1998 suggests that, on average, EVs manufactured in 1998 have an Electric Range approximately 13.20 units higher than the reference category.
Model Fit: The R-squared value of 0.9721 indicates that approximately 97.21% of the variance in Electric Range is explained by the predictors included in the model. This suggests a strong overall fit of the model to the data, indicating that the selected predictors collectively provide a robust explanation of Electric Range variation in EVs.This means that the selected predictors collectively have a substantial explanatory power in determining Electric Range variation in EVs.
integer(0) integer(0)
##Key Findings:
Assuming the thresholds mentioned below: - Cook’s distance: A common threshold for Cook’s distance is 1. Values exceeding 1 suggest potentially influential points. - Leverage values: Leverage values range from 0 to 1, with higher values indicating higher leverage. A common threshold for high leverage points is 2 * (p + 1) / n, where p represents the number of predictors and n is the sample size. - Studentized residuals: A common threshold for identifying outliers based on studentized residuals is ±3. Observations with absolute studentized residuals exceeding these thresholds are considered potential outliers.
Residuals vs Fitted Plot: The spread of residuals appears to have a non-constant variance with a pattern that could suggest a non-linear relationship.There are also a few outliers which can potentially affect the model’s accuracy.
Q-Q Plot:The plot does not perfectly follow the normal distribution. There is a deviation from the line in the tails, especially in the upper tail, indicating the presence of outliers or extreme values.
Scale-Location Plot: This plot shows that the spread of residuals is not uniform across the range of fitted values, which could imply that the variance of the residuals is not constant. There are also some points with higher spread at the higher end of fitted values.
Residuals vs Leverage Plot: The Cook’s distance lines suggest that there are points that are potential influencers.
# Removing outliers
train_data_clean <- train_data[-outliers, ]
mlr_model_clean <- lm(formula, data = train_data_clean)
summary_obj <- summary(mlr_model_clean)
f_value <- summary_obj$fstatistic["value"]
f_df1 <- summary_obj$fstatistic["numdf"]
f_df2 <- summary_obj$fstatistic["dendf"]
p_value <- pf(f_value, f_df1, f_df2, lower.tail = FALSE)
model_stats_clean_df <- data.frame(
Metric = c("Residual Standard Error", "Degrees of Freedom", "R-squared", "Adjusted R-squared", "F-statistic", "p-value"),
Value = c(
sprintf("%.2f on %d degrees of freedom", sigma(mlr_model_clean), df.residual(mlr_model_clean)),
NA,
sprintf("%.4f", summary(mlr_model_clean)$r.squared),
sprintf("%.4f", summary(mlr_model_clean)$adj.r.squared),
sprintf("%.2e on %d and %d DF", f_value, f_df1, f_df2),
format.pval(p_value, digits = 2)
)
)
model_stats_clean_df <- na.omit(model_stats_clean_df)
library(knitr)
kable(model_stats_clean_df, format = "markdown", col.names = c("Metric", "Value"), align = 'l')| Metric | Value | |
|---|---|---|
| 1 | Residual Standard Error | 14.30 on 67665 degrees of freedom |
| 3 | R-squared | 0.9793 |
| 4 | Adjusted R-squared | 0.9793 |
| 5 | F-statistic | 1.74e+04 on 184 and 67665 DF |
| 6 | p-value | <2e-16 |
Key Findings: Upon finding the great number of outliers, we thought it would be a good idea to see how the model changes after removing the outliers. Unfortunately, we do not see much of a change except for some sparseness in the data points. Howver there are a few minute changes in the metric.
Comparing the two model fits, we can observe the following:
Coefficients: Upon expanding the summary(mlr_model_clean), we see that a lot of the coefficients with NAs have been removed after removing the outliers as opposed to before.
R-squared: The previous model fit had an R-squared value of 0.9721. In the current model fit, after removing outliers, the R-squared value increased to 0.9793, indicating that approximately 97.93% of the variability is now explained by the independent variables. This suggests a slight improvement in the model’s ability to explain the variation in the dependent variable.
Adjusted R-squared: The adjusted R-squared values for both model fits are very similar. In the previous model fit, the adjusted R-squared was 0.972, while in the current model fit, it remains at 0.9793. This suggests that the additional predictors introduced in the current model did not significantly impact the overall model fit.
F-statistic: The F-statistic assesses the overall significance of the model. In the previous model fit, the F-statistic was 1.768e+04, while in the current model fit, it is slightly lower at 1.74e+04.
Overall, the current model fit, after removing outliers, shows a slightly higher R-squared value, indicating a better fit to the data. The adjusted R-squared, F-statistic, and p-value remain similar between the two model fits, suggesting a consistent and significant fit. From the model output, it is evident that there are several outliers present in the data. However, upon examining the results, no high leverage or influential points meet the specified threshold criteria. And the plots have not changed much either.
Note: We tried to check the presence of multicollinearity by using the VIF, but there was an error about aliases.To fix it, we tried to find the pair of aliases but alias() reports no pairs of aliases. We assume this is due to near-perfect multicollinearity or very high correlation among some of the independent variables, which could be approaching the threshold of numerical tolerance used in the VIF function.
#more data cleaning along the way : trying to identify sparse levels in the model_distribution variable and filtering the dataframe df to exclude the rows corresponding to the sparse levels.
sparse_levels <- names(model_distribution[model_distribution <= 1])
df$Model_Combined <- ifelse(df$Model %in% sparse_levels, 'Other', as.character(df$Model))
df$Model_Combined <- factor(df$Model_Combined)
df <- df[!df$Model %in% sparse_levels, ]
#StepWise
library(MASS)
# Initial model fitting
initial_formula <- as.formula(paste(target, "~", paste(predictors, collapse = " + ")))
initial_model <- lm(initial_formula, data = train_data)
step_model <- stepAIC(initial_model, direction = "both")Start: AIC=386320.7 Electric.Range ~ Model.Year + Make + Model + Electric.Vehicle.Type + Clean.Alternative.Fuel.Vehicle + Base.MSRP + Legislative.District
Step: AIC=386320.7 Electric.Range ~ Model.Year + Model + Electric.Vehicle.Type + Clean.Alternative.Fuel.Vehicle + Base.MSRP + Legislative.District
Df Sum of Sq RSS AIC
Step: AIC=386317 Electric.Range ~ Model.Year + Model + Electric.Vehicle.Type + Clean.Alternative.Fuel.Vehicle + Base.MSRP
Df Sum of Sq RSS AIC
new_predictors <- c("Model.Year", "Make", "Model", "Electric.Vehicle.Type", "Base.MSRP")
to_remove <- c(18910, 24399, 62007, 62619, 64883, 66291)
train_data_cleaned <- train_data[-to_remove, ]
train_data_cleaned$Electric.Range <- log(train_data_cleaned$Electric.Range)
new_formula <- as.formula(paste("log(", target, ") ~", paste(predictors, collapse = ' + ')))
final_model_cleaned <- lm(formula = new_formula, data = train_data_cleaned)
plot(final_model_cleaned)lm(formula = new_formula, data = train_data_cleaned)
Below is a snippet of the coefficients from the model.
| Term | Estimate | Std. Error | t value | p-value |
|---|---|---|---|---|
| (Intercept) | 6.078469 | 16.950329 | 0.359 | 0.719892 |
| Model.Year1998 | 13.203620 | 23.492188 | 0.562 | 0.574089 |
| Model.Year1999 | 29.386572 | 18.205588 | 1.614 | 0.106499 |
| Model.Year2000 | 13.503411 | 18.204538 | 0.742 | 0.458235 |
| … | … | … | … | … |
Residual standard error: 0.02589 on 68547 degrees of freedom
Key Findings:
The new model includes predictors such as Model.Year, Model, Electric.Vehicle.Type, and Base.MSRP. The coefficients for these predictors suggest varying levels of influence on the dependent variable Electric.Range. For instance, certain model years show a negative association, implying that electric range decreases compared to the baseline year. In contrast, other years, like Model.Year2020, show a significant positive effect, indicating a higher electric range. The Electric.Vehicle.TypePlug-in Hybrid Electric Vehicle (PHEV) predictor has a substantial negative coefficient, which means that compared to all-electric vehicles, plug-in hybrids have a significantly lower electric range. The Base.MSRP variable has a varied effect on the electric range. Certain MSRP values are associated with a significant increase or decrease in the electric range, indicating that price may correlate with range capabilities.
The model has a very high R-squared value of 0.99, which implies that approximately 99.0% of the variability in the electric range can be explained by the model’s predictors. The adjusted R-squared is equally high at 0.989, showing that the model fits the data well and that the included predictors contribute to the model’s explanatory power. The F-statistic is also extremely high with a practically zero p-value, which provides evidence against the null hypothesis. This means that the model is statistically significant, and there’s a relationship between the predictors and the electric range. However, despite the high R-squared values, the assumptions are definitely not met completely as seen below.
Checking Assumptions:
The Residuals vs Leverage plot: There are a few points outside the Cook’s distance lines, which suggests the presence of influential observations. These points have the potential to potentially influence the model’s predictions.
The Q-Q plot: Most points follow the line, but there is a clear deviation at both tails, indicating that the residuals have heavier tails than a normal distribution. This suggests that the normality assumption may be violated.
The Scale-Location plot: The plot shows a somewhat uniform spread of residuals across the range of fitted values with a slightly increasing trend, suggesting that variance may increase with the fitted values and it may indicate potential non-constant variance.
The Residuals vs Fitted Plot:The residuals are centered around the horizontal line at zero, which is good. There might be a slight pattern which could suggest some non-constant variance (heteroscedasticity) and there are still some outliers present but since the red line is relatively straight, it could pass the assumption.
The model has some indicators of potential violations of the key assumptions but after much changes made to the data, it seems like there is not much that can be done since the summary shows a robust model with a strong fit to the data. Although, the presence of many outliers could be a concern and may raise concerns regarding validity of model inferences.